Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.7 - Equality Statements in Spark and Comparison with Nulls")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path

pets = spark.read.csv(path, header=True)
pets.toPandas()

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown
1	2	3	Argus	2016-11-22 10:05:10	10	None
2	3	1	Chewie	2016-11-22 10:05:10	15	None
3	3	2	Maple	2018-11-22 10:05:10	17	white

Filtering Data

When you want ot filter data with more than just one expression, there are a couple of gotchas that you will need to be careful of.

Case 1: Multiple Conditions

(
    pets
    .where(
        (F.col('breed_id') == 1) &
        (F.col('color') == 'brown') &
        F.col('color').isin('brown')
    )
    .toPandas()
)

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown

What Happened?

When there exists more than 1 condition you will to wrap each condition in () brackets and as well provide bitwise operations instead of logical operations in Python.

Why?

This is because in the spark internals they had to overwrite the logical operations and was only left with the bitwise operations. This is to my best knowledge, I could be wrong.

Case 2: Nested Conditions

(
    pets
    .where(
        (
            F.col('breed_id').isin([1, 2]) &
            F.col('breed_id').isNotNull()
        ) |
        (F.col('color') == 'white')
    )
    .toPandas()
)

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown
1	3	1	Chewie	2016-11-22 10:05:10	15	None
2	3	2	Maple	2018-11-22 10:05:10	17	white

What Happened?

Similar to before, nested conditions will need to be wrapped with () as well.

Case 3: Equality Statements with `Null` Values, (use `isNotNull()` and `isNull()`)

(
    pets
    .withColumn('result', F.col('color') != 'white')
    .withColumn(
        'result_2', 
        (F.col('color') != 'white') &
        (F.col('color').isNotNull())
    )
    .toPandas()
)

	id	breed_id	nickname	birthday	age	color	result	result_2
0	1	1	King	2014-11-22 12:30:31	5	brown	True	True
1	2	3	Argus	2016-11-22 10:05:10	10	None	None	False
2	3	1	Chewie	2016-11-22 10:05:10	15	None	None	False
3	3	2	Maple	2018-11-22 10:05:10	17	white	False	False

What Happened?

If you do not come from a sql background any comparison with Null will also be Null, unless you specifically use the Null comparisons.

The 2 Null comparisons are isNotNull() and isNull().

Summary

In spark when using a more involved conditional expression, you will need to wrap each condition with () brackets and use the bitwise operations in Python.
Be explicit with you're performing conditional transformations on columns that can be Null.

Section 2.7 - Equality Statements in Spark and Comparison with Nulls

Library Imports

Template

Filtering Data

Case 1: Multiple Conditions

Case 2: Nested Conditions

Case 3: Equality Statements with `Null` Values, (use `isNotNull()` and `isNull()`)

Summary

results matching ""

No results matching ""

Library Imports

Template

Filtering Data

Case 1: Multiple Conditions

Case 2: Nested Conditions

Case 3: Equality Statements with Null Values, (use isNotNull() and isNull())

Summary

results matching ""

No results matching ""

Case 3: Equality Statements with `Null` Values, (use `isNotNull()` and `isNull()`)